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standard  statistical  hypothesis  testing,  such  as  those  based  on  fuzzy  logic, 
we  examine  some  of  the  basic  assumptions  which  underlie  the  standard  tech- 
niques and  briefly  discuss  their  justification  or  usefulness.  In  partic- 
ular, we  show  that  fuzzy  logic  effectively  produces  conservative  estimates 
for  the  conditional  probability  of  the  union  of  sets  since,  in  that  case, 
it  neglects  information  related  to  the  intersection.  We  propose  that  such 
neglect  can  be  remedied,  at  a computational  cost,  without  resorting  explic- 
itly to  the  usual  procedure  of  integrating  over  irregularly  shaped  volumes. 
To  this  end,  but  only  by  way  of  extended  example,  we  introduce  a class  of 
probability  density  distributions  which,  under  conditions  developed  in  the 
paper,  possesses  (hyper)  rectangular  contour.  Explicit  formulas  for  the 
normalization  constant  and  the  probability  of  error  are  then  derived  for 
typical  distributions.  By  the  introduction  of  suitable  formal  approxi- 
mations, the  binary  classification  problem  is  solved  for  two  sample  distri- 
butions of  truncated  domain  and  differing  roll-off.  Finally,  liberties 
taken  with  the  additive  property  of  density  functions  permit  the  reali- 
zation of  a piecewise  linear  discrimination  logic  for  non-rectangular  con- 
tours, The  result  is  a sample  classification  scheme  that  avoids  some  of 
the  computational  expense  or  unduly  constraining  assumptions  implied  in 
the  derivation  of  classical  discriminants.  At  the  same  time  it  takes  some 
rational  advantage  of  all  the  information  available  to  the  analyst.  The 
suggestion  is  implicit  that  the  methods  demonstrated  in  the  example  are 
applicable  to  diverse  situations  which  provide  widely  varying  information 
as  to  domain,  roll-off,  and  contour. 
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^4  ABSTRACT 

A pattern  classification  scheme  which  Is 
grounded  In  classical  probability  theory  may  be  as- 
sociated with  confidence  Intervals  that  represent 
an  estimate  of  the  predictive  capability  of  the 
scheme.  As  a practical  matter,  realistic  alloca- 
tions of  data  acquisition  and  processing  resources 
may  severely  constrain  acceptable  levels  of  predic- 
tability. ^Jtotlvated  by  pattern  classifications 
techniques  whtoh^ represent  radical  departures  from 
standard  statistifcal-Jiypothesis  testing,  such  as 
those  based  on  fuzzy  logTcV^^e  examine  some  of  the  r 
basic  assumptions  which  underlie  the  standard^ tech- 
niques,£nd  briefly  discuss  their  justification  or 
usefulnesifr>^In  particular,  we  show  that  fuzzy  logic 
effectively  produces  conservative  estimates  for  the 
conditional  probability  of  the  union  of  sets  since, 
in  that  case,  it  neglects  information  related  to  the 
Intersection.  We  propose  that  such  neglect  can  be 
remedied,  at  a computational  cost,  without  resorting 
explicitly  to  the  usual  procedure  of  integrating 
over  irregularly  shaped  volumes.  To  this  end,-^>ut 
only  by  way  of  extended  example, <^w«T introduce  a 
class  of  probability  density  distributions  which, 
under  conditions  developed  in  the  paper, ^possesses 
(hyper)  rectangular  contour.  Explicit  formulas  for 
the  normalization  constant  and  the  probability  of 
error  are  then  derived  for  typical  distributions.  -I— 
By  the  introduction  of  suitable  formal  approximations, 
the  binary  classification  problem  is  solved  for  two 
sample  distributions  of  truncated  domain  and  differ- 
ing rolloff.  Finally,  liberties  taken  with  the  ad- 
ditive property  of  density  functions  permit  the  re- 
alization of  a piecewise  linear  discrimination  logic 
for  non-rectangular  contours.  The  result  is  a sample 
classification  scheme  that  avoids  some  of  the  compu- 
tational expense  or  unduly  constraining  assumptions 
implied  in  the  derivation  of  classical  discriminants. 
At  the  same  time  it  takes  some  rational  advantage  of 
all  the  information  available  to  the  analyst.  The 
suggestion  is  implicit  that  the  methods  demonstrated 
in  the  example  are  applicable  to  diverse  situations 
which  provide  widely  varying  information  as  to 
domain,  roll-off,  and  contour. 
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ADMINISTRATIVE  INFORMATION 


The  basic  Ideas  for  distributions  of  rectangular  contour  were  con- 
ceived In  1970  when  the  author  was  at  the  Technlon  - Israel  Institute  of 
Technology  under  a post-doctoral  fellowship.  An  implementation  of  one  of 
the  classification  algorithms  was  realized  successfully  in  a speech  recog- 
nition project  at  NSRDC  in  1973  for  NAVSEA  0311,  Task  Area  SR0140301,  Task 
16565,  Element  61153N.  In  1977,  the  subject  was  re-considered,  sharpened, 
generalized,  and  is  here  documented  for  NAVSEA  03F,  Task  Area  SR0140301, 
Task  15321,  Element  61153N. 
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INTRODUCTION 
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PREDICTABILITY,  EFFICIENCY 

This  report  describes  a scheme  for  discriminating  between  patterns 
drawn  from  distributions  whose  isoprobable  density  contours  are  hyper- 
polygonal  surfaces.  However,  an  underlying  purpose  of  this  Introduction 
is  not  so  much  the  motivation  of  yet  another  classification  algorithm  as 
it  is  the  discussion  and  clarification  of  some  fundamental  issues  in 
pattern  classification  for  which  the  proposed  algorithm  is  but  an  example 
of  a remedy. 

Current  trends  in  hypothesis  testing  as  applied  to  pattern  classifi- 
cation exhibit  certain  characteristics  which  this  work  attempts  to  identi- 
fy and  exploit.  If  one  were  to  plot  the  accuracy  of  prediction  against  the 
cost  of  data  acquisition  and  processing,  a graph  like  that  in  Figure  1 might 
be  expected.  The  graph  shows  two  breakpoints:  the  first 

ACCURACY 


- COST 


Figure  1 - Form  of  Accuracy  versus  Cost 


(A)  at  the  start  of  an  interval  in  which  enough  data  have  been  assembled  to 
initiate  the  structural  approximation  of  the  density  distributions  under 
test;  the  second  (B)  indicating  the  onset  of  saturation,  a region  of  dimin- 
ishing return  corresponding  to  the  limiting  convergence  property  of  a law 
of  large  numbers.  The  lower  end  of  the  curve  suggests  that  either  not 
enough  data  have  been  collected  to  reveal  the  structure  of  the  sample 
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distributions  or  that  the  model  chosen  does  not  represent  a large  enough 
investment  in  data  processing  capability  to  take  advantage  of  the  fine 
information  structure  available  in  the  data  that  have  been  collected.  The 
gain  region  indicates  an  area  in  which  sufficient  data  are  available  and 
one  has  room  to  trade  off  models  of  increasing  complexity  and  (hopefully) 
accuracy  against  increasing  cost.  Finally,  the  saturation  region  reflects 
our  axiomatic  belief  in  the  limit  nature  of  frequency  distributions,  as  re- 
flected, for  example,  in  a weak  law  of  large  numbers.  This  region  is  the 
least  sensitive  to  model  trade-offs,  although  it  can  certainly  be  depressed 
by  a sufficiently  poor  choice  of  model.  Attempts  have  been  made  both  to 
measure  and  to  improve  accuracy  at  reduced  cost  in  the  gain  region  between 
A and  B.  Measurement,  for  example,  might  take  the  form  of  calculating  con- 
fidence intervals,  albeit  for  an  arbitrarily  chosen  distribution,  often  the 
Normal.  Similarly,  improvement  of  accuracy  at  reduced  cost  might  involve 
assuming  distributions  whose  resulting  discriminant  requires  relatively  few 
computations.  On  the  other  hand,  choosing  a parametric  model  which  seri- 
ously misrepresented  the  data  could  result  in  a decrease  in  accuracy  for  an 
increase  in  samples.  Such  a model  would  fall  outside  the  class  of  decision 
surfaces  represented  by  Figure  1.  There  are  non-parametric  models  that  will 
take  advantage  of  all  the  Information  gathered  from  the  viewpoint  of  being 
able  to  separate  given  data  sets,  but  that  offer  little  explicit  rationale 
for  forecasting  new  data.  On  the  other  hand,  many  parametric  models,  which 
represent  a best  guess  as  to  the  eventual  distribution  of  points  or  a sta- 
tistic of  the  distribution,  are  effective  only  in  the  gain  and  saturation 
regions  and  require  sizable  amounts  of  data  to  be  meaningful,  discounting 
prior  engineering  knowledge. 

Even  the  axioms  of  probability  theory  are  not  impervious  to  distor- 
tion - intentionally  or  not  - for  the  sake  of  reducing  cost.  In  a sense, 
some  non-parametric  models  represent  a tacit  option  for  computational  con- 
venience and  cost  reduction  to  the  neglect  of  the  axiom  that  requires  the 
convergence  of  frequency  distribution  to  a limit  function.  Some  para- 
metric models  acknowledge  the  axiom  by  assumption  but  make  no  serious 
effort  to  empirically  justify  the  assumption  on  the  basis  of  the  available 
data,  again  for  the  sake  of  computational  convenience.  Models  based  on 
potential  functions  which  are  related  to  clustering  procedures  and 
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nearest-neighbor  techniques  present  an  interesting  compromise  between 

« 

classical  parametric  and  non-parametric  models  in  the  sense  that  densities 
are  effectively  assumed  at  each  sample  point  and  are  independently  summed  to 
form  an  effective,  overall  density  distribution.  The  axioms  of  convergence 
and  of  additivity  are  both  thus  manipulated  for  the  sake  of  efficiency.  In 
this  case,  however,  it  is  not  clear  that  arbitrarily  assumed  local  densities 
can  be  effectively  extrapolated  to  determine  the  asymptotic  behavior  of  the 
whole  distribution.  The  Independence  and  usual  spherical  symmetry  of  the 
assumed  densities  may  especially  be  called  into  question,  but  even  if  these 
assumptions  are  accepted,  the  local  density  should  not  be  expected  to  do 
more  than  predict  where  future  samples  are  likely  to  fall.  In  other  words, 
distributions  of  bounded  support  are  perhaps  more  realistic  assumptions  and 
they  will  be  accordingly  addressed  later  in  this  report. 

As  in  the  case  of  potential  functions,  an  axiom  of  probability  that  is 
sometimes  tacitly  modified  is  that  of  additivity.  For  example,  the  applica- 
tion of  fuzzy  set  theory  to  pattern  classification  can  be  regarded  in  prac- 
tice - theoretical  protests  notwithstanding  - as  a theory  derived  from  the 
axioms  of  probability  with  the  modification  of  the  axiom  of  additivity.  This 
subject  will  be  reviewed  further  on.  Indeed,  a basic  aim  of  this  report 
will  be  to  show  how  fundamental  assertions  about  probability  can  be  reason- 

t 

ably  modified  and  yet  yield  classification  algorithms  with  predictive  power. 

Basically,  it  is  argued  here  that  the  two  principal  functional 

objectives  of  a pattern  classification  scheme  are 

o predictability 
o computational  efficiency 

and  that,  while  the  theoretical  foundations  for  prediction  are  fragile, 
the  knowledge  and  predilection  for  computational  efficiency  are  robust. 
Indeed,  even  if  one  knew  of  methods  that  realized  a strong  relationship 
between  past  and  future  events,  the  financial  limitations  of  any  partic- 
ular application  would  force  a trade-off  between  the  amount  of  data  that 
could  be  collected  - hence,  the  predictability  - and  the  amount  of  human 
or  computer  time  available  to  process  the  data.  As  matters  stand,  many 
• experiments  are  by  nature  data-limited,  and  no  amount  of  processing  just- 

ifies the  confidence  that  is  often  placed  in  forecasts  that  are  derived 
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from  the  data.  On  the  other  hand,  one  should  not  jump  to  the  extreme 
conclusion  that  the  gentle  skepticism  with  which  we  review  some  of  the 
basic  assumptions  underlying  both  hypothesis  testing  and  the  concept  of 
probability  itself  implies  that  clarification  techniques  are  of  no  value. 
Quite  the  contrary,  the  notion  that  the  future  lies  in  the  past,  even  when 
sketchily  described,  is  the  essential  ingredient  in  the  human  understanding, 
of  or  at  least  belief  in,  causality  and  the  meaning  of  experience.  The 
ideas  in  this  report  do  not  advocate  a break  with  the  empirical  tradition 
of  letting  the  results  of  experiment  determine  whether  assumptions  about 
the  statistical  mechanisms  used  to  describe  a process  are  justified  or  need 
revision.  All  that  will  be  suggested  is  that,  on  the  one  hand,  since  the 
reliability  of  forecasting  techniques  is  ill-founded  in  principle  and  suf- 
ficient data  are  often  not  available  in  fact,  one  is  entitled  to  indulge  a 
bent  toward  assumptions  about  nature  that  permit  convenience  in  computing 
tests  of  prediction;  on  the  other  hand,  these  assumptions  need  not  be  made 
rigidly  and  blindly,  but  rather  should  take  into  account  engineering  knowl- 
edge or  biases  about  the  process  in  question. 

CONVERGENCE,  ADDITIVITY 

At  the  root  of  any  attempt  to  predict  the  occurrence  of  events  by  an 
application  of  probability  theory  is  the  idea  that  the  bounded  relative 
frequency  of  occurrence  is  a point  set  function  that  will  converge  with 
sample  size  in  the  limit  to  a point  set  measure  called  a probability. 

That  is,  for  sample  size  n drawn  say  from  the  real  line,  a sequence 
F^(x<X)  is  hypothesized  which  monotonically  increases  with  X and  converges 
with  n to  some  P(x<X).  Alternatively,  frequency  distributions  f(x)  might 
be  considered  which  converge  to  a probability  density  p(x).  The  prob- 
ability of  the  random  event  ACRn  is,  roughly  speaking,  the  value  of  a 
real,  non-negative,  additive  set  function  P(A)  whose  domain  is  a Borel 
field  of  sets  B and  which  is  normalized  in  the  sense  that  P(U  )=1,  where 
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Ug  is  the  unit  element  of  B.  The  two  jarring  notes  in  the  applied  prob- 
ability scheme  of  things  are  convergence  and  additivity.  While  we  have 
intuitive  explanations  available  to  bolster  our  belief  in  the  efficacy  of 
those  assumptions,  our  desire  for  theoretical  elegance  or  computation  -.1 
convenience  may  be  in  fact  the  dominating  prejudice.  Moreover,  the  evidence 
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as  to  whether  limited  experiments  of  the  coin-tossing  or  even  statistical 
mechanics  variety  are  consistent  with  the  assumptions  noted  does  not  pro- 
vide us  with  a logically  tight  case  for  employing  the  assumptions  in  other 
problem  areas,  especially  when  there  are  severe  restrictions  on  the  amount 
of  data  to  be  gathered.  The  possibilities  that  frequency  distributions 
might  radically  change  their  form  or  parameters  their  value  as  more  data 
are  accumulated,  the  weak  law  of  large  numbers  notwithstanding,  or  that, 
for  mutually  exclusive  events  A,B,  the  equation 

P(AUB)  = P (A) + P(B)  (1) 

may  not  hold,  are  considerations  that  temper  our  dogmatic  attitude  toward 
the  peculiar  representation  of  the  future  offered  by  probability  theory. 

Indeed,  in  recent  years  at  least  one  theory,  namely  Zadeh's  fuzzy  set 
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theory  * , has  challenged  these  assumptions  and  offered  - with  reser- 

3 4 

vation  ’ - a weakening  of  the  assumptions  and  hence  an  apparently  flex- 

ible means  of  describing  the  persistence  of  trends  or  at  least  our 
intuition  about  them.  For  example,  in  lieu  of  a conditional  probability 
limit  function,  we  may  speak  of  a membership  function  h(A|x)  that  describes 
the  degree  to  which  an  element  is  a member  x of  a set  A.  The  empirical 
means  by  which  we  capture  the  function  is  left  unspecified,  although  it  is 
not  unreasonable  that  a frequency  distribution  might  be  an  adequate  means 
of  expressing  the  normative  quality  of  concepts  regarded  as  fuzzy  - like 
young.  On  the  other  hand,  we  are  free  to  cast  the  quantitative  expres- 
sion of  a membership  function  in  the  world  of  our  subjective,  engineering 
judgment. 

With  respect  to  the  assumption  of  additivity,  first  note  how  inclu- 
sion and  union  are  calculated  in  the  theory  of  fuzzy  sets.  A is  said  to 
be  included  in  B,  ACB  if  Vx,  y(A|x)  <p(B|x).  The  union  of  two  fuzzy  sub- 
sets A,B  - each  of  which  contains  by  definition  every  element  of  the 
universe  with  some  non-negative  degree  of  membership  - is  defined  to  be  the 
smallest  fuzzy  subset  contained  in  both  A and  B.  The  resulting  membership 
function  is  calculated  by  the  equation 


y(AUB|x)  = p(A|x)  V p (B I x) 

= max  [u(A|x)  ,p(B|x)] 
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The  similarity  to  Equation  (1)  is  apparent.  However,  the  estimate  of 
membership  in  Equation  (2)  is  conservative  - as  will  be  shown  - since  the 
contribution  of  the  subset  in  which  x is  a member  to  a smaller  degree  is 
disregarded.  This  neglect  arises  properly  only  for  those  x whose  degree 
of  membership  is  neither  zero  nor  one  in  either  of  the  joined  classes.  To 
the  extent  that  the  degree  is  zero  for  some  x,  the  subsets  - in  a non- 
fuzzy  sense  - may  be  considered  mutually  exclusive.  In  order  to  take  the 
overlapping  of  subsets  into  account.  Equation  (1)  viewed  as  a conditional 
probability  would  read 

P(AUB|x)  *=  P(A| x)+P(B | x)-P(AnB|x)  (3) 

in  order  to  eliminate  the  double-counting  in  the  intersection  which  occurs 
because  of  the  additivity  assumption.  One  can  now  see  that  the  fuzzy  esti- 
mate is  indeed  conservative  since,  if  for  example  y(AUB|x)=P(A|x) , Equation 
(3)  and  the  fact  that  P(b|x)  >_  P(AUB|x)  imply  that  y(AUB|x)  £P(AUB|x). 

For  those  disciplined  in  probability  theory.  Equation  (3)  may  seem  a more 
reasonable  way  to  consider  the  knowledge  gathered  from  all  classes,  but 
Equation  (2)  is  easier  to  compute  and  lends  itself  well  to  the  construction 
of  a fuzzy  logic  devoid  of  Lebesque  integration  over  the  real  line  which  is 
characteristic  of  probability  calculations.  A recent  paper  by  Stallings^ 
offers  discussion,  empirical  examples,  and  references  contrasting  the  fuzzy 
set  approach  with  Bayesian  statistics.  It  is  interesting  to  note  that  a 
published  comment  by  Jain^  on  Stallings'  paper  asserts  that  fuzzy  set 
theories  may  be  developed  for  different  axioms  of  additivity,  although, 
as  the  reply  of  Stallings  notes,  the  axiom  used  here  predominates  in  the 
literature. 

A non-trivial  analytical  example  is  in  order.  Consider  the  situation 
in  which  people  of  various  ages  are  being  judged  as  to  youthfulness  and 
vigor.  Let 


y - young 
v “ vigorous 
a - age 


Define: 


PWL  [X;  {(X1,Y1)}]  as  a piecewise-linear  curve  joining  the  points 


(X1,Y1). 


P(v,y,a)  = K[(l-a/50)v(6(y)S0-6(y-l)S1) 

* 10S2(6(y)=6(y-l))J 


where 


0 < V < 10 
y * 0 or  1 
0 < a _<  100 
K * 1/11000 

SQ  - PWL  [a; 0,0; 30,0; 50,1; 100,1] 

51  - PWL  [a;0, 1 ;30, 1 ;50,0; 100,0] 

52  - PWL  [a;0,0;40, .5; 100,1] 

<S(Y)  is  the  Dirac  delta  function. 

If  we  define  y(v<v|a)  = P(v<v|a)  and  y(y<Y  |a)  = P(y<Y|a),  the  calculation 
of  the  degree  of  the  union  is  simply 

Pj  = P (v<V  U y<Y|a)  = P(v<V|a)  7 P(y<Y |a) 

Some  elementary  integration  of  the  joint  density  expressed  by  Equation  (4) 
elicits  the  following  table: 


. 

a 

P(v<5  U y<0  a) 

p(v<5  u y<i 

20 

9/16 

13/32 

40 

65/88 

1/2 

60 

27/37 

77/148 

As  expected,  the  fuzzy  estimate  is  less  than  the  probability.  The  inter- 
pretation of  the  results  - which  are  themselves  dependent  on  the  arbitrary 
choice  of  the  joint  density  - is  that  young  people  are  either  not  youthful 
appearing  or  not  very  vigorous  with  probability  greater  than  1/2  and  degree 


less  than  1/2.  Intuition  might  favor  the  fuzzy  estimate.  On  the  other 
hand,  the  interpretation  continues,  declaring  that  old  people  are  of  the 
sort  mentioned  with  probability  about  2/3  and  degree  about  1/2.  Here,  it 
seems  that  intuition  might  favor  the  probability  estimate.  While  it  is 
true  that  the  issue  hinges  on  the  dubious  choice  of  Equation  (4),  the  use 
of  probabilistic  or  fuzzy  estimates  nevertheless  has  in  principle  the 
potential  for  counter-intuitive  decisions.  This  phenomenon  may  be  a com- 
ment on  the  reasonableness  of  our  intuition  - not  necessarily  of  the 
assumptions  - and  suggests  that  conformity  of  derived  results  to  intuition 
may  not  be  an  especially  good  test  of  the  validity  of  the  assumptions.  On 
the  other  hand,  Kahneman  and  Tversky^  - motivated  by  empirical  evidence 
that  people's  evaluation  of  the  probability  of  events  is  significantly  dif- 
ferent than  that  derived  from  the  classical  assumptions  - offer  a heuristic 
measure  which  represents  the  subjective  judgment  of  likelihood  and  is 
largely  independent  of  sample  size.  Nevertheless,  it  is  debatable  whether 
or  not  counter-intuitive  findings  suffice  to  justify  discarding  basic  as- 
sumptions about  the  persistence  of  trends.  Perhaps  we  should  rather  be 
satisfied  that  confidence  in  the  structure  of  distributions  can  be  deter- 
mined only  empirically  after  the  fact. 


SYMMETRY,  UNIMODALITY,  NORMALITY 

Even  if  one  attributes  predictive  potential  to  a collection  of  pattern 
vectors  in  the  sense  that  their  distribution  approximates  an  unspecified 
limit  function,  calculation  of  the  distribution  is  often  rendered  difficult 
or  meaningless  by  the  choice  of  grid  and  the  sparseness  of  the  data.  As  a 
result,  the  analyst  takes  refuge  in  positing  a distribution  and  estimating 
its  parameters  or  in  constructing  a distancelike  function  to  characterize 
membership  in  a pattern  class  - for  example,  the  fuzzy  functions  mentioned 
previously.  These  efforts  to  achieve  computational  convenience  may  be  in- 
fluenced by  the  traditional  notion  that  a pattern  class  is  generated  by  the 
perturbation  of  a hypothetical  standard  pattern  due  to  noise  or  error. 

With  no  prior  reason  to  judge  otherwise,  the  error  model  suggests  a 
distribution  which  is  symmetric  about  one  point  - the  "pattern"  - or  at 
worst  along  orthogonal  axes  but  at  any  rate  unimodal.  As  a practical 


matter,  the  Ideal  standard  pattern  cannot  be  determined,  and  the  circum- 
stance of  the  mean  vector  being  symmetrically  perturbed  or  being  located 
at  the  maximum  density  Is  a special  case  with  no  logical  necessity  other 
than  formalistic  elegance  or  computational  convenience. 

Still  another  Inference  that  may  be  made  from  the  error  model,  often 
with  recourse  to  a central  limit  theorem,  is  that  the  error  constitutes  a 
normally  distributed  sum  of  independent  random  variables  each  generated 
by  a hypothetical  mlcroprocess.  In  the  peculiar  event  that  the  are 

identically  distributed,  with  distribution  F(X)  possessing  finite  variance 

2 8 
a , the  sum  indeed  approaches  the  normal  distribution  in  the  sense  that 

x 

iii  P Zk-nJxdF(x) ) <x]  = -^jTe  -*2/2dz  (5) 

Here,  F(X)  is  said  to  be  in  the  domain  of  normal  attraction  of  the  normal 
distribution.  However,  if  F(X)  has  infinite  variance,  the  limit  distribu- 
tion may  be  any  of  a large  class  of  so-called  stable  distributions  - which 
include  the  unitary,  the  Cauchy,  the  Pearson  (type  v)  among  others  in  addi- 
tion to  the  normal  - in  a sense  similar  to  Equation  (5)  [Gnedenko  and 

O 

Kolmogorov,  p.  181].  In  fact,  for  a doubly  indexed  random  variable 

mn 


the  limit  distribution  of ~ for  arbitrary  F ^(X)  - may  be  any  of 

k=l 

the  so-called  infinitely  divisible  distributions*  which  Include  the  stable, 
the  Poisson,  and  distributions  of  finite  sums  of  independent,  infinitely 
divisible  random  variables.  With  such  a wide  range  of  assumptions  and  as- 
sociated limit  distributions  available,  the  choice  of  a normal  distribution 
must  be  influenced  by  ignorance  of  the  nature  of  the  fictional  micro- 
processes and  by  the  computational  convenience  afforded  by  facts  such  as 
"the  sum  of  independent  normal  random  variables  is  itself  normal."  On  the 
other  hand,  computing  discriminant  surfaces  and  error  probabilities  even  for 
multivariate  normal  distributions  is  not  as  easy  as  might  be  desired,  and 
some  remedies  will  be  suggested  further  on. 


*A  canonical  representation  of  the  characteristic  function  of  an  in- 
finitely divisible  distribution  is  given  by  the  Levy-Khintchine  formula 

g 

[Gnedenko  and  Kolmogorov,  p.  76]. 
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Computational  efficiency  aside,  the  error  model,  if  Insisted  upon,  might 
be  better  interpreted  by  seeking  a function  that  empirically  matched  the 
shape  of  the  distribution  near  the  mean  - roll-off  at  a distance  being 
non-critical  - rather  than  arbitrarily  choosing  the  normal. 


i 

♦ 


HYPERPLANES  AND  THE  HINGE  PROBLEM 

In  order  to  avoid  the  computational  difficulties  of  empirically 

estimating  distributions  or  of  using  them,  analysts  have  had  recourse  to 

distance,  potential  function,  correlation,  clustering,  fuzzy,  and  nonpara- 

metric  sequential  interpretations  - among  others  - of  the  classification 

problem.  Since  each  scheme  tacitly  reflects  a class  of  distributions  for 

which  it  is  best  suited,  the  scheme  in  general  must  perform  sub-optlmally , 

and  sometimes  poorly.  For  some  classification  schemes,  the  computational 

efficiency  gained  does  not  compensate  for  the  consequent  lack  of  flexi- 

9 

bility  in  adapting  to  practical  situations.  For  example,  Sebestyan, 
p.44,  effectively  seeks  to  maximize  the  discriminant  which  separates  the 
sets  F and  G with  respect  to  a clustering  of  F: 

S(f,£)  = E [ (f-£)TW(f-g_  - (f-f)TW(f-7)]  (6) 

over  positive  semi-definite  transformations  W subject  to  the  constraint 

n(f,£)  = E [(£ -g)  Tw  ( f_-g)  - (g-g)  TW  (£-&)]  = K (7  ) 

where 

E is  the  ensemble  mean  operator; 

f,g  are  random  variables  in  classes  F,G  with  means  f,g,  respect- 
ively ; 

k is  a positive  constant. 

By  setting  6(x,f)**<5(x,g)  for  a test  point  x,  one  can  see  that  the  discrim- 
inant surface  is  a hyperplane 

xTW(g-f)  - 1/2  (^Wg  - fTWf)  (8) 

Consider  the  points 

x = d7  + (l-d)g,  (0<d<l)  (9) 

on  the  line  connecting  the  class  means.  The  solution  of  Equations  (8)  and 
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(9)  for  cl  reveals  that  the  discriminant  hyperplane  is  hinged  at  d*l/2.  It 
is  easy  to  discover  linearly  separable  distributions  for  which  a hinged 
hyperplane,  however  efficient  to  calculate,  has  no  orientation  which  pro- 
vides near-optimal  or  even  satisfactory  discrimination.  One  could  unhinge 
the  discriminant  hyperplane  by  reformulating  the  problem  to  include  a 
variable  threshold  thus  maximizing 

6(f  ,&)  - s (i-I)TW(g.-f)  (10) 

2 

subject  to  the  constraint 

n(f,&)  - s (g-f)TW(£-f)  = K 
2 


Whereas  Equations  (6)  and  (7)  have  a straightforward  solution,  the  solution 
to  Equations  (10)  and  (11)  (see  Appendix  A)  requires  maximizing  the  maxi- 
mum eigenvalue  of  the  matrix 


where 


(R(g)  + R(f)  - s Y) (R(f)  + R(p  - s Y) 
2 2 


(ID 


R(h)  = E(h,hT) 
Y = R(£-f) 


(12) 


over  js,  a challenging  computational  task.  In  the  light  of  difficulties  of 
this  sort,  one  might  just  as  well  have  assumed  two  normal  distributions, 
for  which  the  hyperplane  discriminant  has  no  known  closed  form  but  may  be 
calculated  by  recursive  approximation,^  as  indeed  may  Equation  (12). 


SUMMARY  OF  THE  ARGUMENT 

From  the  preceding  discussion,  one  should  have  received  the  impres- 
sion that  statistical  decisions  are  often  bigger  gambles  than  the  numbers 
indicate  because  of  the  arbitrariness  with  which  one  chooses  the  assump- 
tions. Since  predictability  is  such  a slippery  notion,  analysts  have 
locked  themselves  into  simplifying  assumptions  about  empirical  processes, 
ranging  from  additivity  of  measures  to  hinged  discriminant  hyperplanes, 
all  in  the  name  of  computational  efficiency.  In  the  discussion  to  follow, 
an  approach  to  choosing  assumptions  is  introduced  for  which  the  computa- 
tional penalty  is  not  excessive,  and  all  information  is  used  in  some 
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intuitively  agreeable  sense.  The  method  will  offer  the  flexibility  - and 
arbitrary  complexity  - of  determining  or  ignoring  dissymetry,  polymodality, 
and  roll-off  while  relaxing  the  additivity  assumption  and  dealing  with  un- 
hinged, arbitrarily  oriented,  piecewise  linear  surfaces.  The  proposed 
technique  has  its  limitations  and  cannot  be  unique  in  an  arena  of  choices 
dictated  primarily  by  intuitive  agreement,  but  it  may  nevertheless  be  a con- 
tribution to  a means  for  exploratory  data  analysis  and  offers  an  opportunity 
to  escape  from  the  confining  assumptions  inherent  in  statistical  handbooks. 


I 
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DISTRIBUTIONS  OF  RECTANGULAR  CONTOUR 


FORMS  OF  THE  DISTRIBUTIONS 

The  density  distributions  discussed  in  this  chapter  are  of  hyper- 
rectangular  isoprobabilistic  contour.  These  contours  are  closed,  concen- 
tric surfaces  which  circumscribe  hyperellipsoids  whose  principal  semi-axes 
are  proportional  to  the  square  roots  of  the  eigenvalues  of  a covariance 
matrix  and  lie  along  the  corresponding  eigenvectors  of  Q^.  If  the 
covariance  matrices  Q^,  are  positive  definite,  they  can  be  simul- 
taneously diagonalized  by  a transformation  A such  that 

AQlAT  = I and  AQ^1  = A (13) 

where  A is  the  eigenvalue  matrix  of  Q1_1Q2  • Therefore,  in  the  case  of 
binary  discrimination,  it  suffices  to  discuss  two  distributions  in  Rn, 
namely:  P.(XTXx),  whose  mean  is  zero  and  whose  isoprobabilistic  contours 

* rn  __  

are  hypercubes;  and  A (X-X)),  whose  mean  is  x,  whose  isoprob- 

abilistic contours  are  hyperrectangular,  and  whose  principal  semi-axes  are 
parallel  to  those  of  P . 

The  hyperrectangular  contours  of  the  distributions  to  be  analyzed 
are  based  on  the  following  proposition: 

PROPOSITION . A hypercube  may  be  represented  by  the  equation 

ifrpi  + e”)h/"  - X vJ  (l4> 

where 

Pj  = max  {X  , Sj),  and  Qy  X,  pt^  are  arbitrary. 

Proof  For  x^  = iJi^X,  * except  that  |<(>^|  = 1,  the  left-hand  side  of 

Equation  (14)  becomes 


«spj  pj  o*2'Vp/"  (15) 

2 

Where  N is  the  number  of  equal  to  one.  When  X >&j,  the  parenthesized 
expression  in  Equation  (15)  yields 


(16) 


■ . -B'.  ' 


and  when  X £6.  , the  parenthesized  expression  in  Equation  (15)  reduces 

L 1 

to  6'  . END  OF  PROOF. 


The  power  of  the  Proposition  is  that  it  permits  the  synthesis  of  sym- 
metric distributions  with  roll-offs  of  extensive  complexity  subject  to  a 
finite  volume  constraint,  as  the  following  examples  demonstrate. 

An  exponential  squared  - i.e.,  normal  roll-off  - may  be  achieved  by  the 
distribution 

2.nu  1/m- 


Pa(2£>  = liS  <a  exp  [_1/2(£  "] 


(17) 


By  letting 


Bj-1,  0J-1,  Bj=0j=0  (j*l) 


one  sees  that  the  Proposition  is  satisfied  and  the  contours  are  conse- 
quently hyperrectangular . 

The  distribution  has  finite  volume  and  may  be  normalized  as  follows.  Let 


A A1/2, 


Then  Equation  (17)  implies  that 


m (E  (z2)m)1/m  = -2  1n(pa  k”1  det  A1/2) 


(18) 


which  is  the  equation  of  a hypercube.  If  z =0  (j^i)  , then  the  squared 
*2  3 

semi-axis  z.  is  given  by  the  right-hand  side  of  Equation  (18)  and  the 


integral  of  Equation  (17)  is  calculated  by  the  expression 

A A 


J n (2*i)dpa  = J 2n(2/ln(pa1A))n/2dp, 


(19) 


where  A=  < det  A ^2.  Let  q - p A ^ . Then  the  normalization 

a M *a 


i 


>3n/2  / -hn/2. 


integral  (19)  becomes 

A 2"'“'  *■  / (In  q_‘)“'‘dq  = 1 . 

The  solution  to  Equation  (20)  (Gradshteyn,  Ryzlink,11  p.  525,  4.215.1) 
gives  the  result 

,-3n/2  , . 1/2 


(20) 


k = 2 det  A ' /r(l+n/2) 

cL 


(21) 


The  fact  that  the  elliptical ly-contoured  normal  density  distribution  is 
simply  approximated  by  the  rectangularly-contoured  normal  distribution 
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through  a change  of  eigenvalues  (see  Appendix  B)  suggests  that  any  compu- 
tational difficulty  with  one  distribution  will  not  be  alleviated  for  the 
other.  Moreover,  since  the  requirement  for  finite  volume  is  met  by  a nor- 
mal roll-off  on  the  order  of  exp  l-nx^],  but  not  as  slow  as  x~n  for  non- 
zero tails,  we  will  resort  to  distributions  of  finite  support  to  achieve 
other  roll-offs.  These  distributions  with  truncated  roll-off  exhibit  compu- 
tational advantages  and  are  in  accord  with  the  physical  intuition  that  all 
measurements  of  real  phenomena  fall  within  a finite  distance  of  the  class 
mean.  Before  discussing  the  truncated  case,  however,  it  will  be  helpful  to 
display  a density  distribution  which  not  only  illustrates  the  necessity  for 
fast  roll-off  when  non-zero  tails  are  Included,  but  also  serves  as  a means 
of  generating  distributions  with  truncated  roll-off.  The  density 

pb(*)  = <b  (£  (Vi>m+em)'h/m  (22) 

is  normalized  by  means  of  the  integral 


/b®" 

/ 2ndet  A 


l/2,p  . . -n/2h , . 

(Pb/Kb)  dpb  = 1 


which,  for  2h>n,  yields 


k,  = 2 ndet 
b 


(tt)' 


h-n/2 


The  restriction  2h>n  implies  that  the  roll-off  must  be  faster  than  x~n 
as  mentioned  previously.  Slower  roll-off  for  distributions  with  truncated 
roll-off  can  be  achieved  by  carefully  combining  densities  of  the  form 
given  by  Equation  (22).  The  following  density  is  an  example  of  such  a 
combination. 


PC<I>  - 1J8  «c  [(Z(»1Xi)"+65h/"-(|i(Xix^"«*>'’/"]  <25) 

Since  the  distribution  form  satisfies  the  conditions  of  the  Proposition, 
the  contours  are  hyperrectangular . Suppose  that  The  normaliza- 
tion proceeds  as  before,  producing  the  equation 


whose  solution  is 


k - 2"ndet  A1/2d/h(B~-B?)  (27) 

where  d - (n/2)-h  unless  h = n/2,  in  which  case  the  constant  is 

' kc  * 2_ndet  A1/2/(n/2)ln(62/ei)  (28) 

Except  for  the  singular  point  h»0,  useful  solutions  exist  for  all  non- 
negative h.  Moreover,  when  h is  negative,  solutions  exist  for  the  con- 
dition The  formulation  (25)  will  generate,  for  example,  a 

truncated  linear  roll-off  for  h=l/2,  B^>B2>  an<*  a truncated  quadratic 
roll-off  for  h«-l,  0^>B2.  More  complex  profiles  can  be  achieved  by  a 
judicious  combination  of  forms  derived  from  Equation  (25). 

ERROR  EXPRESSIONS 

Due  to  the  simple  geometry  associated  with  rectangular  contours,  the 
error  induced  by  partitioning  a measurement  space  with  the  discriminant 
hyperplane 

s.Tx  + t (29) 


can  be  easily  calculated  without  reference  to  the  analytic  forms  of  the 
densities  by  reducing  the  n-dimensional  volume  Integral  to  a one-dimen- 
sional integral  along  density,  as  was  done  for  the  normalization  constant. 
Although  the  resultant  error  expression  is  not  amenable  to  linear  optimi- 
zation, the  expression  itself  serves  as  an  intuitive  aid  in  determining 
sub-optimal  discriminants.  The  deliberate  attempt  to  find  an  optimal 
hyperplane  by  minimizing  the  probability  of  error  leads  to  more  compu- 
tational difficulties  than  some  approaches  involving  other  criterion 
functions,  like  the  mean-square  error  (MSE).  However,  those  approaches 
produce  discriminants  that  may  not  be  good  approximations  to  the  discrim- 
inant minimizing  the  probability  of  error,  which  is  regarded  as  the  fund- 
amental criterion.  For  example,  since  the  MSE  hyperplane,  which  is  an 
asymptotic  minimum  MSE  approximation  to  the  optimal  (nonlinear)  Bayes 
discriminant,  is  related  to  the  density  magnitude  rather  than  to  the 


distance  from  the  decision  surface,  the  MSE  does  not  necesarily  vary  with 
the  probability  of  error.  The  decision  to  deal,  then,  with  the  prob- 
ability of  error  is  noteworthy  and  made  possible  by  the  fact  that  explicit 
expressions  are  available,  at  least  in  some  cases  of  importance,  for  the 
probability  of  error.  Without  such  explicit  expressions,  placing  bounds 

on  error  would  be  the  only  means  of  formulating  an  optimum  hyperplane. 
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Haralick,  in  considering  the  Bayes  classification  problem  for  a class 
of  distributions  which  Include  those  discussed  in  this  report,  develops 
lower  bounds  on  the  probability  of  error  due  to  the  difficult  integration 
associated  with  a domain  bounded  by  a hyperplane  discontinuity. 

In  the  following  discussion,  the  density  p will  be  presumed  to  vary 
inversely  with  increasing  values  of  |x| . Since  a density  distribution  may 
be  viewed  as  the  superposition  of  a set  of  unimodal  densities,  the  assump- 
tion is  not  as  restrictive  as  it  may  appear,  at  least  not  for  the  well- 
behaved  densities  one  believes  exist  in  practice.  In  any  event,  the  over- 
all strategy  of  derivation  would  not  differ  appreciably  for  other  p(x). 

As  seen  in  Figure  2 - shown  for  the  case  of  two  dimensions  and  for  a 
T 

given  hyperplane  _s  x = t - as  the  isoprobable  contours  expand,  they  meet 
the  hyperplane  at  2n~^  vertices 


= oi(+  X^172,. 

where  the  signs  are  chosen  so  that 

T T 

IRi  < £ JLj 

and  where 


+ rl/2>T 

— n 


«i  = t/s^  (32) 

Here  it  is  presumed  that  the  hyperplane  is  situated  above  the  mean.  A 
symmetric  argument  exists  if  the  hyperplane  is  located  below  the  mean. 

The  degeneracy  caused  by  s1^  - s1^  in  Equation  (31)  will  be  discussed 
later.  From  Figure  2,  one  can  see  that,  at  a given  x,  the  error  consists 
of  a weighted  triangular  area  ABC  or  its  n-dimensional  right  pyramid 
equivalent,  less  a similar  weighted  area  CDE  - for  large  enough  x - for 
each  vertex  x above  the  hyperplane.  The  error  expression  for  any  of  the 
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areas  (volumes)  mentioned  has  the  expansion 


f I P-0  /P 

/ p dx  » pV  - / V dp 
v=0 


where  V is  a volume  variable.  A side  of  *'  e indicated  volume  is 

+ Axk  - ^ c 

where  x^  is  the  intersection  of  the  hyperplane  and  the  x^-axis,  given 
implicitly  by 


^ * 9* + L <V4V8j  - * 


Since  x. s 

—i— 


= t,  Equation  (35)  yields 


Xk  " xik 


- Z Ax  s / 


J j k 


Therefore,  Equation  (34)  is  equal  to  tox  s)/s^  and  the  error  Equation 
(33)  results  in  the  expression 


pi 

= (AxTs)n/n!  [1 


Sj  dp 


where  p*.'  - pCx. ) . 

l —l 

If  the  profile  is  defined  as 


PQ  = pW 


xk=0,Ml 


= f(xx) 


X = f X(Pq) 


\ - +>VIk  f <v  * ^7  "ikf'1(Po>  <«> 

where  the  signs  for  are  chosen  to  conform  with  the  appropriate  vertex 
as  indicated  by  Equations  (30)  and  (31).  When  Equation  (40)  is  introduced 
into  the  error  expression  (37),  the  latter  becomes 


nmvi 


i = nil  s.  JQ 


(41) 


in  general,  then,  the  total  error  expression  for  densities  not  truncated 
before  extending  the  2m  ^ th  vertex  beyond  the  hyperplane  is 

-m-1 


(42) 


where  6^  is  the  Kronecker  delta. 

Discrimination  between  two  distributions  whose  means  are  separated  by 
the  vector  5 requires  the  coordinate  transformation 


x'  = -(x+6) 


(43) 


in  order  to  retain  the  form  of  the  preceding  error  expressions  for  the 
distribution  which  is  displaced  from  the  origin.  From  the  point  of  view 
of  such  a distribution,  the  hyperplane  has  the  form 


Therefore,  Equation  (41)  holds  for  the  displaced  distribution  if  and  only  if 


which  yields  the  following  solution  for  p^ 

= f ((£T6.-t)//x7  jJn  ) 


if  the  distribution  at  the  origin  has  the  normalized  vertex 
whose  components  are  +1 , the  expression  for  “p  is  then 


With  the  aid  of  the  preceding  formulas,  the  error  expressions  for  the 
previously  defined  distributions  may  now  be  derived.  For  normal  roll-off, 


If  one  makes  the  transformation 


2.1/2  , ->,.2.1/2 


q - dn(ica/pao)  )w 


the  error  expression  (41)  becomes 


Eia  “ <a(A1/n!^*qn(q-+^'1)exp(q+q1)2/2)dq 


where 


A±  - (s.TILi)n/n  8j 


V±  " (In(<a/P'±)2)  1/2 


q±  = (s.Ti-t)  /lUi  (53) 

The  expanded  form  of  Equation  (50)  Is  (cf.  Gradshteyn,  Ryzhik,11  p.  337, 
3*462.1  for  the  Integral  and  p.  1066,  9.247.1  for  the  recurrence  relation) 

Eia  = KaAlexp(^i2/4)D-n(^i'>  (54) 

where  D_n(.)  is  a parabolic  cylinder  function  of  degree  n.  Since  D n(«)  can 
be  directly,  if  laboriously,  related  to  a normal  error  function  (Gradshteyn 
and  Ryzhik,  p.  1067,  9.254.1,  2 and  the  recurrence  relation  just  cited), 
further  analysis  does  not  seem  profitable  (cf.  Anderson  and  Bahadur10). 

By  a similar  derivation,  one  can  display  component  error  expressions 
f°r  Pjj*  Pc  as  follows: 

Elb  - Vn+2h(  (*h-2n-2)  ! ! / (4h)  ! ! ) q^"2*1  (55 ) 

where  k ! ! = k(k-2)(k-4)  ...,  and  A£,  q±  are  defined  in  Equations  (51)  and 
(53).  For  the  truncated  distribution  p with  h>l,  the  component  error 
expression  is 

EJC  - ‘A  Jr(?)  (nT(n+2h-j))  [fa)  (-^J  (56) 


* 
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For  h<l,  a similar  derivation  produces  a complicated  expression  which  will 
not  be  detailed  here. 


In  general,  the  preceding  error  expressions  are  not  amenable  to  the 
usual  minimizing  techniques,  due  not  only  to  the  nonlinearity  of  the  re- 
sulting extremal  equations,  but  also  to  the  variety  of  forms  produced 
by  various  boundary  constraints.  As  a trivial  example,  if  the  distribu- 
tions for  two  classes  - with  eigenvectors  _e  , _n  respectively  - do  not 
overlap,  one  can  find  s^,  t such  that 

T 


se±<  t 

< iVt 


(57) 

(58) 


which  imply  that  the  error  is  zero.  A more  useful,  non-trivial  case  is 
specified  by  the  constraints 


■J&2  — T— i > c > 6 1 

Hi)  > t > sT(6-^7  nt) 


(59) 


(60) 


for  which  the  i—  component  of  error  - based  on  Equation  (56)  - is 

n 

e.  - y 


1 A,)‘ 


(61) 


where 


w. 


1 Hi 

k - 1,2 
k - 1,2 


"k  ~ n+2\“J  » 

Bik  ° 2hkAikKkc/n!Wk’ 

and  y is  arbitrary.  This  constraint  will  be  presumed  to  hold  in  the  rest 
of  the  discussion. 

Since  the  roll-off  of  p^  decreases  monotonically  with  density,  the 
error  component  can  be  further  refined  by  extending  the  analysis  of 
Anderson  and  Bahadur*®  to  the  present  case.  Thus,  the  expressions  for 
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8.,t  are; 


1 “ (tjl  + t2A)  _6 

t - t16.T(t1I  + t2A)'2« 


where 


Inserting  these  formulas  Into  Equation  (61)  produces  a one-parameter 
family  of  equations  which  appear  no  easier  to  minimize  than  the  general- 
ization of  Sebestyan's  formulation  which  was  considered  in  Equation  (12). 

In  the  next  section,  a sub-optimal  solution  will  be  discussed  which  depends 
on  the  degeneracies  presented  next. 


DEGENERACIES 

The  usual  notion  of  degeneracy  applies  to  the  situation  in  which  one 
or  each  of  the  covariance  matrices  is  singular.  In  such  a case,  if  the 
distributions  occupy  different  subspaces,  test  points  may  be  classified 
with  zero  probability  of  error  simply  by  determining  whether  or  not  a com- 
ponent differs  from  the  mean  in  a direction  which  is  orthogonal  to  the 
reduced  subspace  of  a distribution.  If  the  reduced  subspaces  are  identical, 
the  problem  is  handled  in  the  usual  manner,  since  the  covariance  matrices 
will  be  nonsingular  in  the  reduced  subspace. 

Another  form  of  degeneracy  occurs  when  the  discriminating  hyperplane 
is  parallel  to  a surface  of  each  distribution.  This  occurs  when  one  or 
more  components  of  £ are  zero  in  the  space  in  which  the  covariance  mat- 
rices are  simultaneously  diagonalized.  This  implies,  for  example,  that 
T T 

s - s.  Jij  for  some  i,j  not  equal.  With  the  assumption  that  js  = 

(s 1 , s2 * * • • *8j »0> • • • ,0)  it  is  not  difficult  to  re-derive  the  formula  for 
the  component  error  of  Equation  (61).  The  result  is  as  follows: 

J 


* "u>  ((*r )"’  -(j*f )"’)(-  jjjj*)’] 


are  re-defined  as 

. T T ^ T ^ 

* (‘ 

(66) 

" j-J+lnij/2n  Jj5lSj 

(67) 

2hkAikKkc/J,wk  * kml»2 

(68) 

We  will  make  use  of  degeneracies  to  determine  sub-optimal  solutions  in 
the  following  section. 


safeier 


DISCRIMINANT  LOGIC 


BINARY  CLASSIFICATION  FOR  RECTANGULAR  CONTOUR 

In  the  previous  section,  a formulation  for  the  minimization  of  error 
was  presented  which  did  not  permit  an  easy  solution.  Here,  a sub-optimal 
formulation  is  presented  which  eases  the  problem  somewhat.  For  each  dim- 
ension, consider  the  hyperplane 


where  = (0, . . . ,0,Sj ,0, . . . ,0)  (69) 

The  problem  is  thus  reduced  to  a set  of  n one- dimensional  problems.  If  we 
substitute  Equation  (69)  into  the  degenerate  error  component  form  speci- 
fied by  Equation  (65),  differentiate  by  d/dt". , where 


and  set  the  result  to  zero,  a non-linear  equation  is  generated  as 
follows ; 


n+2h„-l 


Equation  (70)  need  not  be  summed  over  the  vertex  set  in  accordance  with 
Equation  (42),  since  no  vertex  is  preferred  in  the  one-dimensional  degen- 
erace  case.  Solutions  to  Equation  (70)  may  be  found  either  numerically 
or,  for  example,  by  piecewise  linear  approximation.  However,  one  must 
remember  that  Equation  (70)  was  derived  on  the  basis  of  the  overlap  con- 
straints expressed  by  Equations  (59)  and  (60).  These  constraints  permit 
the  decision  hyperplane  to  be  placed  in  any  of  seven  distinct  regions,  for 
each  of  which  the  error  component  Equation  (65)  is  slightly  different. 
Nevertheless,  Equation  (70)  has  the  same  form  - with  different  coefficients 
- in  each  region.  The  regions  are  determined  by  the  value  of  tj  relative 
to  the  breakpoints 


For  example.  In  the  region 


the  error  expression  Is 


“^1  ei  1 ci  1 ei 


Ei  - Ei  + Pic(0)(\j^ei  - t4)  n 


(71) 


where  Is  the  error  component  Equation  (65).  The  added  term  on  the 
right  of  Equation  (71)  simply  Implies  that 

Plc(0>  II  (eiVBT) 
iC  j*l  J 1 


is  added  to  the  term  in  Equation  (70).  The  number  of  regions  can 
effectively  be  reduced  to  three,  indicated  by  the  breakpoints  [0,6],  by 
introducing  the  following  additional  constraints: 


\le\,  + o 

>>  tj/e-j 

(6-t^)/n^ 


(72) 

(73) 

(74) 


These  constraints  focus  attention  on  the  competition  between  roll-offs  and 
effectively  disregard  the  truncations  in  both  roll-off  and  density. 

Because  each  side  of  Equation  (70)  in  the  region  [0,6]  is  monotonic, 
one  can  develop  a recurrence  relation  for  the  solution  - if  it  exists  - by 
examining  the  slope  of  the  functions.  At  any  point  Tq  in  [0,6]  the  lines 
of  slope  of  the  functions  described  by  the  left  and  right  sides,  respec- 
tively, of  Equation  (70)  are 


f,(t4)  = wt?:1?,  + (arw)tJJ  + bj 


LoJ  j 


f2(7j)  = -v(6-70j)V-1(6-7j)  + (v+a2) (6-7qj)v  + b2 


(75) 


(76) 


The  solution  to  Equations  (75)  and  (76)  is  the  recurrence  relation 

vw-ai)7”_,  4 + (v+a,)  (6-7lf_1  4)V  + (b,-^)  - v6(6-7u  , 4'V_1 
t 


(w-a1)7^_11  + (v^a2)(6-7k_lti)V  + (b2-bl)  - v6(6-tk_lti)- 

__/r  <a<  nV-1  (77) 


~W~1  , r ~ . V-l 

wtk-i,j  ■ v(6"Vi,j) 
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The  monotonicity  of  the  functions  ensures  convergence.  A check  on  Equation 
(77)  is  that  in  the  event  w - v,  “ b2»  both  Equations  (77)  and  (70) 
have  the  same  fixed  point,  namely: 


: _ l/w{  /,  1/w  j.  l/w\ 

'kj  " a2  6j/(al  + a2  > 


(78) 


Finally,  note  that  the  necessary  and  sufficient  conditions  for  a solution 
to  Equation  (70)  to  exist  are 

(79) 
v 


0 < br»2 ' a25j 


0 4 Vbl  < “l5! 


(80) 


The  decoupling  of  dimensions  effected  by  the  degenerate  hyperplanes  of 
Equation  (69)  in  formulating  the  sub-optimal  solution  may  now  be  removed 
by  averaging  the  hyperplanes  over  the  dimensions.  Alternatively,  one  may 
choose  that  hyperplane  intersected  by  the  line  joining  the  centers  of  the 
distributions,  or  one  may  keep  the  piecewise  linear  structure  in  a 
decision  tree. 

BINARY  CLASSIFICATION  FOR  POLYGONAL  CONTOUR 

When  faced  with  the  task  of  analyzing  distributions  whose  isoprobable 
contour  is  more  complicated  than  ellipsoidal  or  rectangular,  one  realizes 
how  well  suited  those  contours  are  for  rectangular  co-ordinates;  their  mini- 
mum and  maximum  distances  from  the  origin  in  any  subspace  lie  exactly  and 
symmetrically  on  a set  of  orthogonal  axes  and  are  determined  - aside  from 
a scale  factor  - solely  by  the  distance  of  the  rms  density  from  the  origin, 
or  equivalently,  by  the  eigenvalues  of  the  associated  covariance  matrix. 

One  could  construct  dependent  coordinate  systems  which  would  be  suitable 
for  the  description  of  particular  contours,  but  in  order  to  handle  densi- 
ties of  arbitrary  contour,  recourse  must  be  had  to  the  traditional  selection 
of  grids,  frequency  counts,  and  their  attendant  problems.  On  the  other 
hand,  by  restricting  the  contours  to  be  concentric  and  - aside  from  a scale 
factor  - invariant  in  shape,  one  can  extend  the  theory  developed  for  rec- 
tangular contours  in  a natural  way  as  outlined  in  that  which  follows. 

Let  each  sample  be  represented  in  a spherical  coordinate 

system  centered  at  the  mean  m.  A decomposition  of  the  sample  distribution 
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Into  distributions  of  rectangular  contour  will  now  be  sought.  Select  a 
grid  based  on  an  infinite  volume  solid  angle  and  compute  the  rms  density 
within  each  grid  volume  element 


rms 


It. 


A6^ 


<V.t)2/n  1/2 


During  the  course  of  the  computation,  the  opportunity  clearly  exists  for 
detecting  polymodes  and  wild  points  and  for  estimating  fall-off,  but  these 
procedures  will  not  be  detailed  here.  Peak-picking  on  the  inverse  rms  den- 
sity over  the  grid  elements  - by  a steepest  descent  technique,  for  example  - 
will  then  determine  the  magnitude  of  one  maximum  radial  vector  _Vj  for  each 
component  distribution,  located  by  definition  along  the  central  ray  of  a 
grid  element.  The  inverse  rms  density  in  the  grid  elements  surrounding 
the  element  containing  the  maximal  ray  may  be  compared  with  |vj|/2,  and 
those  elements  which  exceed  that  threshold  are  collected  to  form  a half- 
power-point set.  The  set  of  samples  thus  effectively  identified  is  now 
projected  symmetrically  through  the  origin  at  the  mean  m.  The  original 
and  image  sets  are  joined  to  form  a set  whose  mean  m (by  a symmetry  argu- 
ment) and  covariance  matrix  Jj  may  be  computed  in  some  orthogonal  co- 


ordinate system  for  which  v^  is  a basis  vector. 


A Gram-Schmidt  (E^ , h^) 


orthonormalization  procedure  will,  for  example,  provide  the  required  basis. 
When  mj,  Z ^ , and  a roll-off  parameter  h^  have  been  computed  for  each 
density  peak  - referred  to  a single  co-ordinate  system  - each  pair  may 
be  used  as  the  statistic  for  a rectangular  distribution  confined  to  the 
half-space  indicated  by  the  eigenvector  corresponding  to  the  maximum 
eigenvalue  of  E^.  The  only  difference  in  form  of  the  effective  density 
from  the  probability  densities  formulated  in  the  previous  section  is  that 
the  normalization  constant  K is  doubled  and  the  density  is  defined  to  be 
zero  in  the  complement  to  the  half-space  just  mentioned.  At  this  point, 
it  would  be  desirable  to  form  pairwise  discriminant  hyperplanes  between 
the  component  densities  of  two  given  distributions  with  arbitrary  isoprob- 
able contour,  but  it  is  first  necessary  to  take  into  consideration  the  in- 
tersection of  the  component  distributions.  Earlier  in  this  report,  it  was 
noted  that  fuzzy  set  theory  offers  a conservative  means  of  estimating  the 
probability  of  the  union  of  sets  by  disregarding  some  Information.  Here, 


however,  it  is  possible  to  take  all  available  information  into  account  and 
at  the  same  time  not  assume  the  computational  burden  of  explicitly  inte- 
grating over  the  intersection  contour.  To  do  this,  consider  the  set  prob- 
abilities P.CA.;  E.),  P?(A„;  E„) . Define  the  set  union  probability  as 


and  the  set  intersection  probability  as 


The  rectangular  solid  corresponding  to  *s  contained  within  the  convex 

hull  of  the  intersection  of  the  solids  corresponding  to  E^  and 
Therefore,  the  approximation  designated  by  Equation  (82)  implies  that  the 
estimate  of  the  set  union  probability  is  liberal.  Nevertheless,  all  the 
available  information  has  been  taken  into  account.  Indeed,  one  might  go  a 
step  further  at  some  computational  expense  and  specify  the  intersection 
volume  exactly  by  simultaneously  diagonalizing  the  covariance  matrices  and 
then  deriving  the  eigenvalues  for  the  intersection  covariance  matrix  as 
follows:  „ . 


where  the  diagonalized  matrices  have  eigenvalues  {1,  ...,  1}  (the  identity) 
and  {Xj,  ...,  XR}  , respectively,  and  6_  is  the  vector  between  the  set  means. 
If  < 6^,  the  intersection  is  null.  The  distribution  supported  by  the 
intersection  in  fact  has  neither  the  roll-off  nor  the  rectangular  contour 
of  the  component  distributions,  although  the  approximation  is  explicitly 
made  here  to  the  contrary.  The  eigenvalues  in  Equation  (83)  have  been 
specified  along  non-orthogonal  co-ordinates,  and  an  inverse  transformation 
to  normal  coordinates  must  be  made  before  the  eigenvalues  of  the  covariance 
matrix  of  the  presumed  intersection  distribution  can  be  calculated.  The 
set  union  probability  implied  by  Equation  (83)  has  been  specified  for  dis- 
tinct means,  whereas  the  means  of  the  component  distributions  are  in  fact 
coincident.  This  element  of  generality  has  been  purposely  introduced  to 
suggest  that  what  has  been  done  is  not  merely  an  approximation,  but  a 
modification  of  the  axiom  of  additivity  for  the  closed  system  of  dis- 
tributions with  rectangular  contour,  and  thus  defines  a logic  of  such 
distributions  as  does  the  set  of  Equations  (81),  (82).  It  remains  now  to 


explicate  the  form  of  the  probability  of  error.  If  the  eigenvectors  of 
the  Intersection  covariance  matrix  are  represented  as  the  modal  matrix 

°T  = [^1*  •••’ 

then  the  error  contribution  due  to  the  component  matrices  is  reduced  by  an 
amount  of  the  form 


w„ 


Eu  = |det 


n / \ //  , , \w0  / s D6-t\  \(  s D6-t  v 


(84) 


due  to  an  intersection  distribution  with  vertex  n.  centered  (with  its 
image)  at  8,  and  under  the  constraint  Equation  (60).  The  form  of  Equa- 
tion (84)  is  based  on  the  fact  that  D represents  a rigid  notation  which 

T 

implies  that  the  hyperplane  takes  the  form  £ D(8_-s)  = t.  The  matrix  D is 
unitary,  and  therefore  the  Jacobian  det  D is  unity.  When  all  components, 
vertex  components,  and  intersections  have  been  analyzed,  one  could  for- 
mulate a total  error  similar  to  Equation  (»)1)  or  deal  with  the  distri- 
butions on  a component  pairwise  basis  so  as  to  structure  a piecewise  linear 
discriminant  surface  and  its  accompanying  ordering  format,  a decision  tree. 
Application  of  the  degenerate  approximation  discussed  at  the  beginning  of 
this  section  would  be  appropriate  to  determine  the  hyperplanes  in  either 
case. 

At  this  point,  it  might  be  appropriate  to  consider  the  various  j5,  Z 

as  random  variables  and  calculate  confidence  intervals  for  the  decision 

surfaces  just  determined.  More  exactly,  the  expectation  and  variance  of 

the  probability  of  error  should  be  calculated  as  a function  of  the  linear 

classifier.  The  classifier,  in  turn,  is  a function  of  the  sample  means 

and  covariance  matrices,  and  therefore  of  the  number  of  samples  and  dimen- 
14 

sions.  Foley  has  performed  such  an  anaylsis  for  the  Fisher  linear 
discriminant  and  underlying  identical  multivariate  normally  distributed 
densities.  In  the  present  case,  however,  an  explicit  form  for  the  dis- 
criminant is  not  available,  although  some  further  approximation  to  Equa- 
tion (70)  may  be  possible.  In  lieu  of  an  analytic  solution,  it  is  possible 
to  acquire  some  understanding  of  the  expected  error  and  its  variance  by  a 
numerical  method  which  exploits  a variation  of  the  so-called  U or  "hold- 
one-out"  rule.^  The  rule  states  that  the  hyperplane  _s,t  is  designed 
on  the  basis  of  N-l  samples  and  tested  on  the  remaining  sample.  Different 
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hyperplanes  are  then  successively  designed  for  each  sample  held  out.  Each 
hyperplane  now  represents  a value  of  the  probability  of  error,  as  calcu- 
lated on  the  basis  of  techniques  described  earlier  in  this  report.  Thus, 
one  has  a means  not  only  of  calculating  the  mean  and  variance  of  the  prob- 
ability of  error  but  also  of  the  hyperplane  coefficients  themselves. 

Hence,  confidence  intervals  may  be  computed  if  one  chooses  to  presume  that 
the  samples  of  the  probability  of  error  are  normally  distributed  or  indeed 
distributed  in  a way  that  could  be  calculated  by  the  functions  described 
in  this  report. 


SUMMARY  AND  CONCLUSION 

A methodology  has  been  presented  in  this  paper  for  the  description  of 
probability  density  distributions  with  concentric  rectangular  isoprobable 
contour,  associated  first  and  second  order  statistics,  and  arbitrary  roll- 
off. Included  in  the  description  are  derivations  of  the  normalization 
constant  and  the  expression  for  the  probability  of  error  with  respect  to 
an  arbitrary  but  fixed  hyperplane.  A piecewise  linear  approximation  to 
the  solution  of  the  binary  classification  problem  for  a class  of  rectang- 
ular distributions  with  bounded  support  and  monotonically  decreasing  roll- 
off was  derived  on  the  basis  of  error  computation  along  decoupled  co- 
ordinate axes.  The  paper  then  discussed  the  approximate  decomposition  of 
overlapping  rectangular  distributions.  An  approximation  to  the  error  in- 
crement contributed  by  the  overlap  - which  would  constitute  double-counting 
if  not  subtracted  from  the  total  error  - was  constructed.  The  approxima- 
tion, in  its  general  formulation,  suggests  a modification  to  the  axiom  of 
additivity  which  nevertheless  takes  into  consideration  all  the  information 
available  in  a reasonable  fashion  while  still  striving  for  computational 
efficiency.  Finally,  a procedure  was  discussed  for  estimating  the  expected 
probability  of  error  and  its  variance. 

The  results  just  noted  make  possible  the  generation  of  a computation- 
ally efficient  discriminant:  namely,  a piecewise  linear  surface  - a tree 
of  hyperplanes  - together  with  the  explicit,  closed-form  calculation  of 
the  probability  of  error  and  its  mean  and  variance  for  a wide  class  of 
densities  without  unduly  restricting  assumptions  on  their  forms.  On  the 
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other  hand,  the  actual  computation  of  the  classifier  may  still  be  a prodi- 
gious task.  Although  generally  it  is  presumed  that  this  task  will  be 
done  only  once,  a slowly  changing  environment  or  the  accumulation  of  more 
samples  may  require  updating  the  statistics  and  hence  the  discriminating 
surface.  The  techniques  discussed  in  this  report  do  not  lend  themselves 
well  to  such  an  update,  and  other  techniques  - such  as  sequential  classi- 
fication^’^ - must  be  pursued. 

In  conclusion,  the  concept  postulated  is  that  a discriminant  surface 
can  be  expected  to  have  extrapolative  power  only  if  it  is  stable  in  the 
face  of  new  samples.  In  the  case  of  parametric  models  - like  those  pre- 
sented in  this  report  - such  stability  is  realized  only  at  the  computa- 
tional cost  of  gathering  enough  data  to  analyze  the  fine  density  structure. 
However,  in  designing  approximations  to  reduce  costs,  one  need  not  sacrifice 
accuracy  if  the  approximations  do  not  adhere  to  the  traditional  axioms  of 
probability  but  are  reasonable  exploitations  of  the  data. 
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APPENDIX  A 

OPEN  FORM  SOLUTION  TO  THE  HINGE  PROBLEM* 


If  f and  g are  uncorrelated,  Equations  (10)  and  (11)  may  be  rewritten 


so  as  to  seek  the  maximum  of  r 

1 

tr  W (R(£)+R(f)-5 

v] 

(Al) 

over  W,  s such  that  p 

1 

tr  W (R(f)+R(^)-| 

Y = k 

J 

(A2) 

L 

Define  the  diagonalizations 

EUET  R(&)  + Rd>  ” | 

Y 

(A3) 

CVCT  = R(f)  + R(&)  " | 

Y 

(A4) 

B+ftB  = Vly,2CTWCV1/^2 

(A5) 

where  U,  V, flare  diagonal  and  E,  C,  B are  orthonormal.  Then  if  V is  pos- 
itive definite,  Equations  (Al)  and  (A2)  may  be  rewritten  so  as  to  maximize 
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(A6) 


overfly,  such  that 


Xv  = k 


‘ii 


(A7) 


and 


Q = BV  1/2CTEUETCV  1/2BT 


(A8) 


If  maxi  » then  Equations  (A6)  and  (A7)  have  the  open  solution 

(k,  i = j 
' 0,  otherwi 


(A9) 


Lse 


which  still  leaves  the  task  of  maximizing  over  B,  U,  V.  Equation 

(A8)  implies  that  the  task  is  equivalent  to  seeking  the  maximum  eigenvalue 
of  V_1^2CTEUETCV_1/,2f  or  equivalently  the  maximum  eigenvalue  of  the 
matrix  in  Equation  (12)  over  s. 


*This  is  a generalization  and  a more  elegant  statement  of  the  problem 


u 

and  proof  given  by  Sebestyen,  p.44. 
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APPENDIX  B 

APPROXIMATION  OF  ELLIPSOIDAL  NORMAL  DENSITY 
BY  RECTANGULAR 


The  approximation  chosen  is  that  equiprobable  contours  for  the  same 
density  contain  the  same  volume. 

The  volume  of  an  ellipsoid  is 

(Hn/2/r(14n/2))2n/2  A"1/2  j^ln((2n)"n/2  det  A1/2)jn/2  (B1) 

The  volume  of  a rectangular  solid  is 

23n/2detM-1/2  [ln(2-3n/2detM1/2/r(l+n/2))]  n/2  (B2) 

The  bracketed  quantities  in  Equations  (B-l)  and  (B-2)  are  equal  if 

detM1/2  - (2/n)n/2r(l+n/2)detA1/2  (B3) 

in  which  case  the  unbracketed  quantities  are  also  equal.  Therefore,  Equa- 
tion (B-3)  effectively  defines  a class  of  transformations  which  produce 
the  desired  approximation. 
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